Course Lead: Dr James G. Shanahan (email Jimi via James.Shanahan AT gmail.com)
Name: Your Name Goes Here
Class: MIDS w261 (Section Your Section Goes Here, e.g., Fall 2016 Group 1)
Email: Your UC Berkeley Email Goes Here@iSchool.Berkeley.edu
Week: 5
Due Time: 2 Phases.
HW5 Phase 1 This can be done on a local machine (with a unit test on the cloud such as AltaScale's PaaS or on AWS) and is due Tuesday, Week 6 by 8AM (West coast time). It will primarily focus on building a unit/systems and for pairwise similarity calculations pipeline (for stripe documents)
HW5 Phase 2 This will require the AltaScale cluster and will be due Tuesday, Week 7 by 8AM (West coast time). The focus of HW5 Phase 2 will be to scale up the unit/systems tests to the Google 5 gram corpus. This will be a group exercise
MIDS UC Berkeley, Machine Learning at Scale DATSCIW261 ASSIGNMENT #5
Version 2016-09-25
=== INSTRUCTIONS for SUBMISSIONS === Follow the instructions for submissions carefully.
https://docs.google.com/forms/d/1ZOr9RnIe_A06AcZDB6K1mJN4vrLeSmS2PD6Xm3eOiis/viewform?usp=send_form
HW4 can be completed locally on your computer
In [ ]:
In [ ]:
Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Use the following tables for this HW and join based on the country code (third column of the transactions table and the second column of the Countries table:
transactions.dat Alice Bob|$10|US Sam Sneed|$1|CA Jon Sneed|$20|CA Arnold Wesise|$400|UK Henry Bob|$2|US Yo Yo Ma|$2|CA Jon York|$44|CA Alex Ball|$5|UK Jim Davis|$66|JA Countries.dat United States|US Canada|CA United Kingdom|UK Italy|IT
Justify which table you chose as the Left table in this hashside join.
Please report the number of rows resulting from:
In [ ]:
The following MRJob code, implements a reduce-side join for an inner join. The reducer is almost stateless, i.e., uses as little memory as possible. Use the tables from HW5.2 for this HW and join based on the country code (third column of the transactions table and the second column of the Countries table perform. Perform an left, right, inner joins using the code provided below and report the number of rows resulting from:
Again make smart decisions about which table should be the left table (i.e., crosscheck the code).
Some notes on the code Here, the mapper receives its set of input splits either from the transaction table or from the countries table and makes the appropriate transformations: splitting the line into fields, and emitting a key/value. The key is the join key - in this case, the country code field of both sets of records. The mapper knows which file and type of record it is receiving based on the length of the fields. The records it emits contain the join field as the key, which acts as the partitioning key; We use the SORT_VALUES option, which ensures the values are sorted as well. Then, we employ a trick to ensure that for each join key, country records are seen always before transaction records. We achieve this by adding an arbitrary key to the front of the value: 'A' for countries, 'B' for customers. This makes countries sort before customers for each and every join/partition key. After that trick, the join is simply a matter of storing countries ('A' records) and crossing this array with each customer record.
In [ ]:
import sys, os, re
from mrjob.job import MRJob
class MRJoin(MRJob):
# Performs secondary sort
SORT_VALUES = True
def mapper(self, _, line):
splits = line.rstrip("\n").split("|")
if len(splits) == 2: # country data
symbol = 'A' # make country sort before transaction data
country2digit = splits[1]
yield country2digit, [symbol, splits]
else: # person data
symbol = 'B'
country2digit = splits[2]
yield country2digit, [symbol, splits]
def reducer(self, key, values):
countries = [] # should come first, as they are sorted on artificia key 'A'
for value in values:
if value[0] == 'A':
countries.append(value)
if value[0] == 'B':
for country in countries:
yield key, country[1:] + value[1:]
if __name__ == '__main__':
MRJoin.run()
A large subset of the Google n-grams dataset
https://aws.amazon.com/datasets/google-books-ngrams/
which we have placed in a bucket/folder on Dropbox and on s3:
https://www.dropbox.com/sh/tmqpc4o0xswhkvz/AACUifrl6wrMrlK6a3X3lZ9Ea?dl=0
s3://filtered-5grams/
In particular, this bucket contains (~200) files (10Meg each) in the format:
(ngram) \t (count) \t (pages_count) \t (books_count)
The next cell shows the first 10 lines of the googlebooks-eng-all-5gram-20090715-0-filtered.txt file.
DISCLAIMER: Each record is already a 5-gram. We should calculate the stripes cooccurrence data from the raw text and not from the 5-gram preprocessed data. Calculatating pairs on this 5-gram is a little corrupt as we will be double counting cooccurences. Having said that this exercise can still pull out some simialr terms.
In [ ]:
%writefile googlebooks-eng-all-5gram-20090715-0-filtered-first-10-lines.txt
A BILL FOR ESTABLISHING RELIGIOUS 59 59 54
A Biography of General George 92 90 74
A Case Study in Government 102 102 78
A Case Study of Female 447 447 327
A Case Study of Limited 55 55 43
A Child's Christmas in Wales 1099 1061 866
A Circumstantial Narrative of the 62 62 50
A City by the Sea 62 60 49
A Collection of Fairy Tales 123 117 80
A Collection of Forms of 116 103 82
For HW 5.4-5.5, unit test and regression test your code using the followings small test datasets:
In [ ]:
%%writefile atlas-boon-systems-test.txt
atlas boon 50 50 50
boon cava dipped 10 10 10
atlas dipped 15 15 15
In [ ]:
############################################
# Stripes for systems test 1 (predefined)
############################################
with open("mini_stripes.txt", "w") as f:
f.writelines([
'"DocA"\t{"X":20, "Y":30, "Z":5}\n',
'"DocB"\t{"X":100, "Y":20}\n',
'"DocC"\t{"M":5, "N":20, "Z":5, "Y":1}\n'
])
!cat mini_stripes
In [4]:
### Major steps to consider in you
In [ ]:
###############################################
# Make Stripes from ngrams for systems test 2
###############################################
!aws s3 rm --recursive s3://ucb261-hw5/hw5-4-stripes-mj
!python buildStripes.py -r emr mj_systems_test.txt \
--cluster-id=j-2WHMJSLZDGOY5 \
--output-dir=s3://ucb261-hw5/hw5-4-stripes-mj \
--file=stopwords.txt \
--file=mostFrequent/part-00000 \
# Output suppressed
In [ ]:
In [ ]:
#Using the atlas-boon systems test
atlas boon 50 50 50
boon cava dipped 10 10 10
atlas dipped 15 15 15
In [ ]:
###############################################
# Make Stripes from ngrams
###############################################
!aws s3 rm --recursive s3://ucb261-hw5/hw5-4-stripes-mj
!python buildStripes.py -r emr mj_systems_test.txt \
--cluster-id=j-2WHMJSLZDG \
--output-dir=s3://ucb261-hw5/hw5-4-stripes-mj \
--file=stopwords.txt \
--file=mostFrequent/part-00000 \
# Output suppressed
In [ ]:
!mkdir stripes-mj
!aws s3 sync s3://ucb261-hw5/hw5-4-stripes-mj/ stripes-mj/
!cat stripes-mj/part-*
In [ ]:
"atlas" {"dipped": 15, "boon": 50}
"boon" {"atlas": 50, "dipped": 10, "cava": 10}
"cava" {"dipped": 10, "boon": 10}
"dipped" {"atlas": 15, "boon": 10, "cava": 10}
Solution 1:
In [ ]:
##########################################################
# Pretty print systems tests for generating Inverted Index
##########################################################
import json
print "Systems test 1 - Inverted Index"
print "—"*100
with open("systems_test_index","r") as f:
lines = f.readlines()
for line in lines:
line = line.strip()
word,stripe = line.split("\t")
stripe = json.loads(stripe)
stripe.extend([["",""] for _ in xrange(3 - len(stripe))])
print "{0:>15} |{1:>15} |{2:>15} |{3:>15}".format(
(word), stripe[0][0]+" "+str(stripe[0][1]), stripe[1][0]+" "+str(stripe[1][1]), stripe[2][0]+" "+str(stripe[2][1]))
print "\nSystems test 2 - Inverted Index"
print "—"*100
with open("mj_index","r") as f:
lines = f.readlines()
for line in lines:
line = line.strip()
word,stripe = line.split("\t")
stripe = json.loads(stripe)
stripe.extend([["",""] for _ in xrange(3 - len(stripe))])
print "{0:>15} |{1:>15} |{2:>15} |{3:>15}".format(
(word), stripe[0][0]+" "+str(stripe[0][1]), stripe[1][0]+" "+str(stripe[1][1]), stripe[2][0]+" "+str(stripe[2][1]))
In [ ]:
Systems test mini_stripes - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
"M" | DocC 4 | |
"N" | DocC 4 | |
"X" | DocA 3 | DocB 2 |
"Y" | DocA 3 | DocB 2 | DocC 4
"Z" | DocA 3 | DocC 4 |
systems test atlas-boon - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
"atlas" | boon 3 | dipped 3 |
"dipped" | atlas 2 | boon 3 | cava 2
"boon" | atlas 2 | cava 2 | dipped 3
"cava" | boon 3 | dipped 3 |
In [ ]:
Systems test mini_stripes - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
average | pair | cosine | jaccard | overlap | dice
--------------------------------------------------------------------------------------------------------------
0.741582 | DocA - DocB | 0.816497 | 0.666667 | 1.000000 | 0.800000
0.488675 | DocA - DocC | 0.577350 | 0.400000 | 0.666667 | 0.571429
0.276777 | DocB - DocC | 0.353553 | 0.200000 | 0.500000 | 0.333333
--------------------------------------------------------------------------------------------------------------
Systems test atlas-boon 2 - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
average | pair | cosine | jaccard | overlap | dice
--------------------------------------------------------------------------------------------------------------
1.000000 | atlas - cava | 1.000000 | 1.000000 | 1.000000 | 1.000000
0.625000 | boon - dipped | 0.666667 | 0.500000 | 0.666667 | 0.666667
0.389562 | cava - dipped | 0.408248 | 0.250000 | 0.500000 | 0.400000
0.389562 | boon - cava | 0.408248 | 0.250000 | 0.500000 | 0.400000
0.389562 | atlas - dipped | 0.408248 | 0.250000 | 0.500000 | 0.400000
0.389562 | atlas - boon | 0.408248 | 0.250000 | 0.500000 | 0.400000
--------------------------------------------------------------------------------------------------------------
Once you are happy with your test results proceed to generating your results on the Google n-grams dataset.
Do some EDA on this dataset using mrjob, e.g.,
In [ ]:
In [ ]:
In [ ]:
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.
from nltk.corpus import stopwords stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
For each HW 5.4 -5.5.1 Please unit test and system test your code with respect to SYSTEMS TEST DATASET and show the results. Please compute the expected answer by hand and show your hand calculations for the SYSTEMS TEST DATASET. Then show the results you get with your system.
In this part of the assignment we will focus on developing methods for detecting synonyms, using the Google 5-grams dataset. At a high level:
To accomplish this you must script two main tasks using MRJob:
TASK (1) Build stripes for the most frequent 10,000 words using cooccurence information based on the words ranked from 9001,-10,000 as a basis/vocabulary (drop stopword-like terms), and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).
TASK (2) Using two (symmetric) comparison methods of your choice (e.g., correlations, distances, similarities), pairwise compare all stripes (vectors), and output to a file in your bucket on s3.
For this task you will be able to modify the pattern we used in HW 3.2 (feel free to use the solution as reference). To total the word counts across the 5-grams, output the support from the mappers using the total order inversion pattern:
<*word,count>
to ensure that the support arrives before the cooccurrences.
In addition to ensuring the determination of the total word counts, the mapper must also output co-occurrence counts for the pairs of words inside of each 5-gram. Treat these words as a basket, as we have in HW 3, but count all stripes or pairs in both orders, i.e., count both orderings: (word1,word2), and (word2,word1), to preserve symmetry in our output for TASK (2).
For this task you will have to determine a method of comparison. Here are a few that you might consider:
However, be cautioned that some comparison methods are more difficult to parallelize than others, and do not perform more associations than is necessary, since your choice of association will be symmetric.
Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix.
Please report the size of the cluster used and the amount of time it takes to run for the index construction task and for the synonym calculation task. How many pairs need to be processed (HINT: use the posting list length to calculate directly)? Report your Cluster configuration!
In [ ]:
In [ ]:
print "\nTop/Bottom 20 results - Similarity measures - sorted by cosine"
print "(From the entire data set)"
print '—'*117
print "{0:>30} |{1:>15} |{2:>15} |{3:>15} |{4:>15} |{5:>15}".format(
"pair", "cosine", "jaccard", "overlap", "dice", "average")
print '-'*117
for stripe in sortedSims[:20]:
print "{0:>30} |{1:>15f} |{2:>15f} |{3:>15f} |{4:>15f} |{5:>15f}".format(
stripe[0], float(stripe[1]), float(stripe[2]), float(stripe[3]), float(stripe[4]), float(stripe[5]) )
print '—'*117
for stripe in sortedSims[-20:]:
print "{0:>30} |{1:>15f} |{2:>15f} |{3:>15f} |{4:>15f} |{5:>15f}".format(
stripe[0], float(stripe[1]), float(stripe[2]), float(stripe[3]), float(stripe[4]), float(stripe[5]) )
In [ ]:
Top/Bottom 20 results - Similarity measures - sorted by cosine
(From the entire data set)
—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
pair | cosine | jaccard | overlap | dice | average
---------------------------------------------------------------------------------------------------------------------
cons - pros | 0.894427 | 0.800000 | 1.000000 | 0.888889 | 0.895829
forties - twenties | 0.816497 | 0.666667 | 1.000000 | 0.800000 | 0.820791
own - time | 0.809510 | 0.670563 | 0.921168 | 0.802799 | 0.801010
little - time | 0.784197 | 0.630621 | 0.926101 | 0.773473 | 0.778598
found - time | 0.783434 | 0.636364 | 0.883788 | 0.777778 | 0.770341
nova - scotia | 0.774597 | 0.600000 | 1.000000 | 0.750000 | 0.781149
hong - kong | 0.769800 | 0.615385 | 0.888889 | 0.761905 | 0.758995
life - time | 0.769666 | 0.608789 | 0.925081 | 0.756829 | 0.765091
time - world | 0.755476 | 0.585049 | 0.937500 | 0.738209 | 0.754058
means - time | 0.752181 | 0.587117 | 0.902597 | 0.739854 | 0.745437
form - time | 0.749943 | 0.588418 | 0.876733 | 0.740885 | 0.738995
infarction - myocardial | 0.748331 | 0.560000 | 1.000000 | 0.717949 | 0.756570
people - time | 0.745788 | 0.573577 | 0.923875 | 0.729010 | 0.743063
angeles - los | 0.745499 | 0.586207 | 0.850000 | 0.739130 | 0.730209
little - own | 0.739343 | 0.585834 | 0.767296 | 0.738834 | 0.707827
life - own | 0.737053 | 0.582217 | 0.778502 | 0.735951 | 0.708430
anterior - posterior | 0.733388 | 0.576471 | 0.790323 | 0.731343 | 0.707881
power - time | 0.719611 | 0.533623 | 0.933586 | 0.695898 | 0.720680
dearly - install | 0.707107 | 0.500000 | 1.000000 | 0.666667 | 0.718443
found - own | 0.704802 | 0.544134 | 0.710949 | 0.704776 | 0.666165
—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
arrival - essential | 0.008258 | 0.004098 | 0.009615 | 0.008163 | 0.007534
governments - surface | 0.008251 | 0.003534 | 0.014706 | 0.007042 | 0.008383
king - lesions | 0.008178 | 0.003106 | 0.017857 | 0.006192 | 0.008833
clinical - stood | 0.008178 | 0.003831 | 0.011905 | 0.007634 | 0.007887
till - validity | 0.008172 | 0.003367 | 0.015625 | 0.006711 | 0.008469
evidence - started | 0.008159 | 0.003802 | 0.012048 | 0.007576 | 0.007896
forces - record | 0.008152 | 0.003876 | 0.011364 | 0.007722 | 0.007778
primary - stone | 0.008146 | 0.004065 | 0.009091 | 0.008097 | 0.007350
beneath - federal | 0.008134 | 0.004082 | 0.008403 | 0.008130 | 0.007187
factors - rose | 0.008113 | 0.004032 | 0.009346 | 0.008032 | 0.007381
evening - functions | 0.008069 | 0.004049 | 0.008333 | 0.008065 | 0.007129
bone - told | 0.008061 | 0.003704 | 0.012346 | 0.007380 | 0.007873
building - occurs | 0.008002 | 0.003891 | 0.010309 | 0.007752 | 0.007489
company - fig | 0.007913 | 0.003257 | 0.015152 | 0.006494 | 0.008204
chronic - north | 0.007803 | 0.003268 | 0.014493 | 0.006515 | 0.008020
evaluation - king | 0.007650 | 0.003030 | 0.015625 | 0.006042 | 0.008087
resulting - stood | 0.007650 | 0.003663 | 0.010417 | 0.007299 | 0.007257
agent - round | 0.007515 | 0.003289 | 0.012821 | 0.006557 | 0.007546
afterwards - analysis | 0.007387 | 0.003521 | 0.010204 | 0.007018 | 0.007032
posterior - spirit | 0.007156 | 0.002660 | 0.016129 | 0.005305 | 0.007812
In this part of the assignment you will evaluate the success of you synonym detector (developed in response to HW5.4). Take the top 1,000 closest/most similar/correlative pairs of words as determined by your measure in HW5.4, and use the synonyms function in the accompanying python code:
nltk_synonyms.py
Note: This will require installing the python nltk package:
http://www.nltk.org/install.html
and downloading its data with nltk.download().
For each (word1,word2) pair, check to see if word1 is in the list, synonyms(word2), and vice-versa. If one of the two is a synonym of the other, then consider this pair a 'hit', and then report the precision, recall, and F1 measure of your detector across your 1,000 best guesses. Report the macro averages of these measures.
$$Recall (R) = \frac{TP}{TP + FN} $$
$$F1 = \frac{2 * ( precision * recall )}{precision + recall}$$
We calculate Precision by counting the number of hits and dividing by the number of occurances in our top1000 (opportunities)
We calculate Recall by counting the number of hits, and dividing by the number of synonyms in wordnet (syns)
Other diagnostic measures not implemented here: https://en.wikipedia.org/wiki/F1_score#Diagnostic_Testing
In [ ]:
''' Performance measures '''
from __future__ import division
import numpy as np
import json
import nltk
from nltk.corpus import wordnet as wn
import sys
#print all the synset element of an element
def synonyms(string):
syndict = {}
for i,j in enumerate(wn.synsets(string)):
syns = j.lemma_names()
for syn in syns:
syndict.setdefault(syn,1)
return syndict.keys()
hits = []
TP = 0
FP = 0
TOTAL = 0
flag = False # so we don't double count, but at the same time don't miss hits
## For this part we can use one of three outputs. They are all the same, but were generated differently
# 1. the top 1000 from the full sorted dataset -> sortedSims[:1000]
# 2. the top 1000 from the partial sort aggragate file -> sims2/top1000sims
# 3. the top 1000 from the total order sort file -> head -1000 sims_parts/part-00004
top1000sims = []
with open("sims2/top1000sims","r") as f:
for line in f.readlines():
line = line.strip()
avg,lisst = line.split("\t")
lisst = json.loads(lisst)
lisst.append(avg)
top1000sims.append(lisst)
measures = {}
not_in_wordnet = []
for line in top1000sims:
TOTAL += 1
pair = line[0]
words = pair.split(" - ")
for word in words:
if word not in measures:
measures[word] = {"syns":0,"opps": 0,"hits":0}
measures[word]["opps"] += 1
syns0 = synonyms(words[0])
measures[words[1]]["syns"] = len(syns0)
if len(syns0) == 0:
not_in_wordnet.append(words[0])
if words[1] in syns0:
TP += 1
hits.append(line)
flag = True
measures[words[1]]["hits"] += 1
syns1 = synonyms(words[1])
measures[words[0]]["syns"] = len(syns1)
if len(syns1) == 0:
not_in_wordnet.append(words[1])
if words[0] in syns1:
if flag == False:
TP += 1
hits.append(line)
measures[words[0]]["hits"] += 1
flag = False
precision = []
recall = []
f1 = []
for key in measures:
p,r,f = 0,0,0
if measures[key]["hits"] > 0 and measures[key]["syns"] > 0:
p = measures[key]["hits"]/measures[key]["opps"]
r = measures[key]["hits"]/measures[key]["syns"]
f = 2 * (p*r)/(p+r)
# For calculating measures, only take into account words that have synonyms in wordnet
if measures[key]["syns"] > 0:
precision.append(p)
recall.append(r)
f1.append(f)
# Take the mean of each measure
print "—"*110
print "Number of Hits:",TP, "out of top",TOTAL
print "Number of words without synonyms:",len(not_in_wordnet)
print "—"*110
print "Precision\t", np.mean(precision)
print "Recall\t\t", np.mean(recall)
print "F1\t\t", np.mean(f1)
print "—"*110
print "Words without synonyms:"
print "-"*100
for word in not_in_wordnet:
print synonyms(word),word
In [ ]:
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Number of Hits: 31 out of top 1000
Number of words without synonyms: 67
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Precision 0.0280214404967
Recall 0.0178598869579
F1 0.013965517619
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Words without synonyms:
----------------------------------------------------------------------------------------------------
[] scotia
[] hong
[] kong
[] angeles
[] los
[] nor
[] themselves
[]
.......
Repeat HW5 using vocabulary words ranked from 8001,-10,000; 7001,-10,000; 6001,-10,000; 5001,-10,000; 3001,-10,000; and 1001,-10,000; Dont forget to report you Cluster configuration.
Generate the following graphs: -- vocabulary size (X-Axis) versus CPU time for indexing -- vocabulary size (X-Axis) versus number of pairs processed -- vocabulary size (X-Axis) versus F1 measure, Precision, Recall
In [ ]:
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.
from nltk.corpus import stopwords
stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
In [ ]:
There are many good ways to build our synonym detectors, so for this optional homework, measure co-occurrence by (left/right/all) consecutive words only, or make stripes according to word co-occurrences with the accompanying 2-, 3-, or 4-grams (note here that your output will no longer be interpretable as a network) inside of the 5-grams.
In [ ]:
In [ ]: